Pesquisa | Portal Regional da BVS

1.

Severus: accurate detection and characterization of somatic structural variation in tumor genomes using long reads.

Keskus, Ayse; Bryant, Asher; Ahmad, Tanveer; Yoo, Byunggil; Aganezov, Sergey; Goretsky, Anton; Donmez, Ataberk; Lansdon, Lisa A; Rodriguez, Isabel; Park, Jimin; Liu, Yuelin; Cui, Xiwen; Gardner, Joshua; McNulty, Brandy; Sacco, Samuel; Shetty, Jyoti; Zhao, Yongmei; Tran, Bao; Narzisi, Giuseppe; Helland, Adrienne; Cook, Daniel E; Chang, Pi-Chuan; Kolesnikov, Alexey; Carroll, Andrew; Molloy, Erin K; Pushel, Irina; Guest, Erin; Pastinen, Tomi; Shafin, Kishwar; Miga, Karen H; Malikic, Salem; Day, Chi-Ping; Robine, Nicolas; Sahinalp, Cenk; Dean, Michael; Farooqi, Midhat S; Paten, Benedict; Kolmogorov, Mikhail.

medRxiv ; 2024 Mar 26.

Artigo em Inglês | MEDLINE | ID: mdl-38585974

RESUMO

Most current studies rely on short-read sequencing to detect somatic structural variation (SV) in cancer genomes. Long-read sequencing offers the advantage of better mappability and long-range phasing, which results in substantial improvements in germline SV detection. However, current long-read SV detection methods do not generalize well to the analysis of somatic SVs in tumor genomes with complex rearrangements, heterogeneity, and aneuploidy. Here, we present Severus: a method for the accurate detection of different types of somatic SVs using a phased breakpoint graph approach. To benchmark various short- and long-read SV detection methods, we sequenced five tumor/normal cell line pairs with Illumina, Nanopore, and PacBio sequencing platforms; on this benchmark Severus showed the highest F1 scores (harmonic mean of the precision and recall) as compared to long-read and short-read methods. We then applied Severus to three clinical cases of pediatric cancer, demonstrating concordance with known genetic findings as well as revealing clinically relevant cryptic rearrangements missed by standard genomic panels.

2.

Genetic and behavioral differences between above and below ground Culex pipiens bioforms.

Bell, Katherine L; Noreuil, Anna; Molloy, Erin K; Fritz, Megan L.

Heredity (Edinb) ; 2024 Feb 29.

Artigo em Inglês | MEDLINE | ID: mdl-38424351

RESUMO

Efficiency of mosquito-borne disease transmission is dependent upon both the preference and fidelity of mosquitoes as they seek the blood of vertebrate hosts. While mosquitoes select their blood hosts through multi-modal integration of sensory cues, host-seeking is primarily an odor-guided behavior. Differences in mosquito responses to hosts and their odors have been demonstrated to have a genetic component, but the underlying genomic architecture of these responses has yet to be fully resolved. Here, we provide the first characterization of the genomic architecture of host preference in the polymorphic mosquito species, Culex pipiens. The species exists as two morphologically identical bioforms, each with distinct avian and mammalian host preferences. Cx. pipiens females with empirically measured host responses were prepared into reduced representation DNA libraries and sequenced to identify genomic regions associated with host preference. Multiple genomic regions associated with host preference were identified on all 3 Culex chromosomes, and these genomic regions contained clusters of chemosensory genes, as expected based on work in Anopheles gambiae complex mosquitoes and in Aedes aegypti. One odorant receptor and one odorant binding protein gene showed one-to-one orthologous relationships to differentially expressed genes in A. gambiae complex members with divergent host preferences. Overall, our work identifies a distinct set of odorant receptors and odorant binding proteins that may enable Cx. pipiens females to distinguish between their vertebrate blood host species, and opens avenues for future functional studies that could measure the unique contributions of each gene to host preference phenotypes.

3.

Dollo-CDP: a polynomial-time algorithm for the clade-constrained large Dollo parsimony problem.

Dai, Junyan; Rubel, Tobias; Han, Yunheng; Molloy, Erin K.

Algorithms Mol Biol ; 19(1): 2, 2024 Jan 08.

Artigo em Inglês | MEDLINE | ID: mdl-38191515

RESUMO

The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set. This technique has yet to be used for parsimony problems where the input are binary characters, sometimes with missing values. Here, we introduce the clade-constrained character parsimony problem and present an algorithm that solves this problem for the Dollo criterion score in [Formula: see text] time, where n is the number of leaves, k is the number of characters, and [Formula: see text] is the set of clades used as constraints. Dollo parsimony, which requires traits/mutations to be gained at most once but allows them to be lost any number of times, is widely used for tumor phylogenetics as well as species phylogenetics, for example analyses of low-homoplasy retroelement insertions across the vertebrate tree of life. This motivated us to implement our algorithm in a software package, called Dollo-CDP, and evaluate its utility for analyzing retroelement insertion presence / absence patterns for bats, birds, toothed whales as well as simulated data. Our results show that Dollo-CDP can improve upon heuristic search from a single starting tree, often recovering a better scoring tree. Moreover, Dollo-CDP scales to data sets with much larger numbers of taxa than branch-and-bound while still having an optimality guarantee, albeit a more restricted one. Lastly, we show that our algorithm for Dollo parsimony can easily be adapted to Camin-Sokal parsimony but not Fitch parsimony.

4.

Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model.

Han, Yunheng; Molloy, Erin K.

Algorithms Mol Biol ; 18(1): 19, 2023 Dec 01.

Artigo em Inglês | MEDLINE | ID: mdl-38041123

RESUMO

Cancer progression and treatment can be informed by reconstructing its evolutionary history from tumor cells. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are error-free and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because single-cell sequencing produces sparse, error-ridden data and because tumors evolve clonally. Here, we study the theoretical utility of methods based on quartets (four-leaf, unrooted phylogenetic trees) in light of these barriers. We consider a popular tumor phylogenetics model, in which mutations arise on a (highly unresolved) tree and then (unbiased) errors and missing values are introduced. Quartets are then implied by mutations present in two cells and absent from two cells. Our main result is that the most probable quartet identifies the unrooted model tree on four cells. This motivates seeking a tree such that the number of quartets shared between it and the input mutations is maximized. We prove an optimal solution to this problem is a consistent estimator of the unrooted cell lineage tree; this guarantee includes the case where the model tree is highly unresolved, with error defined as the number of false negative branches. Lastly, we outline how quartet-based methods might be employed when there are copy number aberrations and other challenges specific to tumor phylogenetics.

5.

Single-cell methylation sequencing data reveal succinct metastatic migration histories and tumor progression models.

Liu, Yuelin; Li, Xuan Cindy; Rashidi Mehrabadi, Farid; Schäffer, Alejandro A; Pratt, Drew; Crawford, David R; Malikic, Salem; Molloy, Erin K; Gopalan, Vishaka; Mount, Stephen M; Ruppin, Eytan; Aldape, Kenneth D; Sahinalp, S Cenk.

Genome Res ; 33(7): 1089-1100, 2023 07.

Artigo em Inglês | MEDLINE | ID: mdl-37316351

RESUMO

Recent studies exploring the impact of methylation in tumor evolution suggest that although the methylation status of many of the CpG sites are preserved across distinct lineages, others are altered as the cancer progresses. Because changes in methylation status of a CpG site may be retained in mitosis, they could be used to infer the progression history of a tumor via single-cell lineage tree reconstruction. In this work, we introduce the first principled distance-based computational method, Sgootr, for inferring a tumor's single-cell methylation lineage tree and for jointly identifying lineage-informative CpG sites that harbor changes in methylation status that are retained along the lineage. We apply Sgootr on single-cell bisulfite-treated whole-genome sequencing data of multiregionally sampled tumor cells from nine metastatic colorectal cancer patients, as well as multiregionally sampled single-cell reduced-representation bisulfite sequencing data from a glioblastoma patient. We show that the tumor lineages constructed reveal a simple model underlying tumor progression and metastatic seeding. A comparison of Sgootr against alternative approaches shows that Sgootr can construct lineage trees with fewer migration events and with more in concordance with the sequential-progression model of tumor evolution, with a running time a fraction of that used in prior studies. Lineage-informative CpG sites identified by Sgootr are in inter-CpG island (CGI) regions, as opposed to intra-CGIs, which have been the main regions of interest in genomic methylation-related analyses.

Assuntos

Metilação de DNA , Neoplasias , Humanos , Metilação de DNA/genética , Sulfitos , Análise de Sequência de DNA/métodos , Genoma , Neoplasias/genética , Ilhas de CpG/genética

6.

Improving quartet graph construction for scalable and accurate species tree estimation from gene trees.

Han, Yunheng; Molloy, Erin K.

Genome Res ; 33(7): 1042-1052, 2023 07.

Artigo em Inglês | MEDLINE | ID: mdl-37197990

RESUMO

methods are widely used to estimate species trees from genome-scale data. However, they can fail to produce accurate species trees when the input gene trees are highly discordant because of estimation error and biological processes, such as incomplete lineage sorting. Here, we introduce TREE-QMC, a new summary method that offers accuracy and scalability under these challenging scenarios. TREE-QMC builds upon weighted Quartet Max Cut, which takes weighted quartets as input and then constructs a species tree in a divide-and-conquer fashion, at each step forming a graph and seeking its max cut. The wQMC method has been successfully leveraged in the context of species tree estimation by weighting quartets by their frequencies in the gene trees; we improve upon this approach in two ways. First, we address accuracy by normalizing the quartet weights to account for "artificial taxa" introduced during the divide phase so subproblem solutions can be combined during the conquer phase. Second, we address scalability by introducing an algorithm to construct the graph directly from the gene trees; this gives TREE-QMC a time complexity of [Formula: see text], where n is the number of species and k is the number of gene trees, assuming the subproblem decomposition is perfectly balanced. These contributions enable TREE-QMC to be highly competitive in terms of species tree accuracy and empirical runtime with the leading quartet-based methods, even outperforming them on some model conditions explored in our simulation study. We also present the application of these methods to an avian phylogenomics data set.

Assuntos

Algoritmos , Genoma , Filogenia , Simulação por Computador , Modelos Genéticos

7.

Assessment of plasmids for relating the 2020 Salmonella enterica serovar Newport onion outbreak to farms implicated by the outbreak investigation.

Commichaux, Seth; Rand, Hugh; Javkar, Kiran; Molloy, Erin K; Pettengill, James B; Pightling, Arthur; Hoffmann, Maria; Pop, Mihai; Jayeola, Victor; Foley, Steven; Luo, Yan.

BMC Genomics ; 24(1): 165, 2023 Apr 04.

Artigo em Inglês | MEDLINE | ID: mdl-37016310

RESUMO

BACKGROUND: The Salmonella enterica serovar Newport red onion outbreak of 2020 was the largest foodborne outbreak of Salmonella in over a decade. The epidemiological investigation suggested two farms as the likely source of contamination. However, single nucleotide polymorphism (SNP) analysis of the whole genome sequencing data showed that none of the Salmonella isolates collected from the farm regions were linked to the clinical isolates-preventing the use of phylogenetics in source identification. Here, we explored an alternative method for analyzing the whole genome sequencing data driven by the hypothesis that if the outbreak strain had come from the farm regions, then the clinical isolates would disproportionately contain plasmids found in isolates from the farm regions due to horizontal transfer. RESULTS: SNP analysis confirmed that the clinical isolates formed a single, nearly-clonal clade with evidence for ancestry in California going back a decade. The clinical clade had a large core genome (4,399 genes) and a large and sparsely distributed accessory genome (2,577 genes, at least 64% on plasmids). At least 20 plasmid types occurred in the clinical clade, more than were found in the literature for Salmonella Newport. A small number of plasmids, 14 from 13 clinical isolates and 17 from 8 farm isolates, were found to be highly similar (> 95% identical)-indicating they might be related by horizontal transfer. Phylogenetic analysis was unable to determine the geographic origin, isolation source, or time of transfer of the plasmids, likely due to their promiscuous and transient nature. However, our resampling analysis suggested that observing a similar number and combination of highly similar plasmids in random samples of environmental Salmonella enterica within the NCBI Pathogen Detection database was unlikely, supporting a connection between the outbreak strain and the farms implicated by the epidemiological investigation. CONCLUSION: Horizontally transferred plasmids provided evidence for a connection between clinical isolates and the farms implicated as the source of the outbreak. Our case study suggests that such analyses might add a new dimension to source tracking investigations, but highlights the need for detailed and accurate metadata, more extensive environmental sampling, and a better understanding of plasmid molecular evolution.

Assuntos

Salmonella enterica , Sorogrupo , Cebolas/genética , Fazendas , Filogenia , Plasmídeos/genética , Surtos de Doenças

8.

Inferring population structure in biobank-scale genomic data.

Chiu, Alec M; Molloy, Erin K; Tan, Zilong; Talwalkar, Ameet; Sankararaman, Sriram.

Am J Hum Genet ; 109(4): 727-737, 2022 04 07.

Artigo em Inglês | MEDLINE | ID: mdl-35298920

RESUMO

Inferring the structure of human populations from genetic variation data is a key task in population and medical genomic studies. Although a number of methods for population structure inference have been proposed, current methods are impractical to run on biobank-scale genomic datasets containing millions of individuals and genetic variants. We introduce SCOPE, a method for population structure inference that is orders of magnitude faster than existing methods while achieving comparable accuracy. SCOPE infers population structure in about a day on a dataset containing one million individuals and variants as well as on the UK Biobank dataset containing 488,363 individuals and 569,346 variants. Furthermore, SCOPE can leverage allele frequencies from previous studies to improve the interpretability of population structure estimates.

Assuntos

Bancos de Espécimes Biológicos , Genética Populacional , Frequência do Gene/genética , Genômica , Humanos

9.

Theoretical and Practical Considerations when using Retroelement Insertions to Estimate Species Trees in the Anomaly Zone.

Molloy, Erin K; Gatesy, John; Springer, Mark S.

Syst Biol ; 71(3): 721-740, 2022 04 19.

Artigo em Inglês | MEDLINE | ID: mdl-34677617

RESUMO

A potential shortcoming of concatenation methods for species tree estimation is their failure to account for incomplete lineage sorting. Coalescent methods address this problem but make various assumptions that, if violated, can result in worse performance than concatenation. Given the challenges of analyzing DNA sequences with both concatenation and coalescent methods, retroelement insertions (RIs) have emerged as powerful phylogenomic markers for species tree estimation. Here, we show that two recently proposed quartet-based methods, SDPquartets and ASTRAL_BP, are statistically consistent estimators of the unrooted species tree topology under the coalescent when RIs follow a neutral infinite-sites model of mutation and the expected number of new RIs per generation is constant across the species tree. The accuracy of these (and other) methods for inferring species trees from RIs has yet to be assessed on simulated data sets, where the true species tree topology is known. Therefore, we evaluated eight methods given RIs simulated from four model species trees, all of which have short branches and at least three of which are in the anomaly zone. In our simulation study, ASTRAL_BP and SDPquartets always recovered the correct species tree topology when given a sufficiently large number of RIs, as predicted. A distance-based method (ASTRID_BP) and Dollo parsimony also performed well in recovering the species tree topology. In contrast, unordered, polymorphism, and Camin-Sokal parsimony (as well as an approach based on MDC) typically fail to recover the correct species tree topology in anomaly zone situations with more than four ingroup taxa. Of the methods studied, only ASTRAL_BP automatically estimates internal branch lengths (in coalescent units) and support values (i.e., local posterior probabilities). We examined the accuracy of branch length estimation, finding that estimated lengths were accurate for short branches but upwardly biased otherwise. This led us to derive the maximum likelihood (branch length) estimate for when RIs are given as input instead of binary gene trees; this corrected formula produced accurate estimates of branch lengths in our simulation study provided that a sufficiently large number of RIs were given as input. Lastly, we evaluated the impact of data quantity on species tree estimation by repeating the above experiments with input sizes varying from 100 to 100,000 parsimony-informative RIs. We found that, when given just 1000 parsimony-informative RIs as input, ASTRAL_BP successfully reconstructed major clades (i.e., clades separated by branches $>0.3$ coalescent units) with high support and identified rapid radiations (i.e., shorter connected branches), although not their precise branching order. The local posterior probability was effective for controlling false positive branches in these scenarios. [Coalescence; incomplete lineage sorting; Laurasiatheria; Palaeognathae; parsimony; polymorphism parsimony; retroelement insertions; species trees; transposon.].

Assuntos

Paleógnatas , Retroelementos , Animais , Simulação por Computador , Modelos Genéticos , Filogenia , Retroelementos/genética

10.

Corrigendum to: ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy.

Zhang, Chao; Scornavacca, Celine; Molloy, Erin K; Mirarab, Siavash.

Mol Biol Evol ; 38(10): 4655, 2021 Sep 27.

Artigo em Inglês | MEDLINE | ID: mdl-34417619

11.

Advancing admixture graph estimation via maximum likelihood network orientation.

Molloy, Erin K; Durvasula, Arun; Sankararaman, Sriram.

Bioinformatics ; 37(Suppl_1): i142-i150, 2021 07 12.

Artigo em Inglês | MEDLINE | ID: mdl-34252951

RESUMO

MOTIVATION: Admixture, the interbreeding between previously distinct populations, is a pervasive force in evolution. The evolutionary history of populations in the presence of admixture can be modeled by augmenting phylogenetic trees with additional nodes that represent admixture events. While enabling a more faithful representation of evolutionary history, admixture graphs present formidable inferential challenges, and there is an increasing need for methods that are accurate, fully automated and computationally efficient. One key challenge arises from the size of the space of admixture graphs. Given that exhaustively evaluating all admixture graphs can be prohibitively expensive, heuristics have been developed to enable efficient search over this space. One heuristic, implemented in the popular method TreeMix, consists of adding edges to a starting tree while optimizing a suitable objective function. RESULTS: Here, we present a demographic model (with one admixed population incident to a leaf) where TreeMix and any other starting-tree-based maximum likelihood heuristic using its likelihood function is guaranteed to get stuck in a local optimum and return an incorrect network topology. To address this issue, we propose a new search strategy that we term maximum likelihood network orientation (MLNO). We augment TreeMix with an exhaustive search for an MLNO, referring to this approach as OrientAGraph. In evaluations including previously published admixture graphs, OrientAGraph outperformed TreeMix on 4/8 models (there are no differences in the other cases). Overall, OrientAGraph found graphs with higher likelihood scores and topological accuracy while remaining computationally efficient. Lastly, our study reveals several directions for improving maximum likelihood admixture graph estimation. AVAILABILITY AND IMPLEMENTATION: OrientAGraph is available on Github (https://github.com/sriramlab/OrientAGraph) under the GNU General Public License v3.0. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Software , Humanos , Funções Verossimilhança , Filogenia , Grupos Populacionais

12.

Using Robinson-Foulds supertrees in divide-and-conquer phylogeny estimation.

Yu, Xilin; Le, Thien; Christensen, Sarah A; Molloy, Erin K; Warnow, Tandy.

Algorithms Mol Biol ; 16(1): 12, 2021 Jun 28.

Artigo em Inglês | MEDLINE | ID: mdl-34183037

RESUMO

One of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a "supertree method". Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP-hard. Exact-RFS-2 is available in open source form on Github at https://github.com/yuxilin51/GreedyRFS .

13.

TIPP2: metagenomic taxonomic profiling using phylogenetic markers.

Shah, Nidhi; Molloy, Erin K; Pop, Mihai; Warnow, Tandy.

Bioinformatics ; 37(13): 1839-1845, 2021 Jul 27.

Artigo em Inglês | MEDLINE | ID: mdl-33471121

RESUMO

MOTIVATION: Metagenomics has revolutionized microbiome research by enabling researchers to characterize the composition of complex microbial communities. Taxonomic profiling is one of the critical steps in metagenomic analyses. Marker genes, which are single-copy and universally found across Bacteria and Archaea, can provide accurate estimates of taxon abundances in the sample. RESULTS: We present TIPP2, a marker gene-based abundance profiling method, which combines phylogenetic placement with statistical techniques to control classification precision and recall. TIPP2 includes an updated set of reference packages and several algorithmic improvements over the original TIPP method. We find that TIPP2 provides comparable or better estimates of abundance than other profiling methods (including Bracken, mOTUsv2 and MetaPhlAn2), and strictly dominates other methods when there are under-represented (novel) genomes present in the dataset. AVAILABILITY AND IMPLEMENTATION: The code for our method is freely available in open-source form at https://github.com/smirarab/sepp/blob/tipp2/README.TIPP.md. The code and procedure to create new reference packages for TIPP2 are available at https://github.com/shahnidhi/TIPP_reference_package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

14.

Using Constrained-INC for Large-Scale Gene Tree and Species Tree Estimation.

Le, Thien; Sy, Aaron; Molloy, Erin K; Zhang, Qiuyi; Rao, Satish; Warnow, Tandy.

IEEE/ACM Trans Comput Biol Bioinform ; 18(1): 2-15, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-32750844

RESUMO

Incremental tree building (INC) is a new phylogeny estimation method that has been proven to be absolute fast converging under standard sequence evolution models. A variant of INC, called Constrained-INC, is designed for use in divide-and-conquer pipelines for phylogeny estimation where a set of species is divided into disjoint subsets, trees are computed on the subsets using a selected base method, and then the subset trees are combined together. We evaluate the accuracy of INC and Constrained-INC for gene tree and species tree estimation on simulated datasets, and compare it to similar pipelines using NJMerge (another method that merges disjoint trees). For gene tree estimation, we find that INC has very poor accuracy in comparison to standard methods, and even Constrained-INC(using maximum likelihood methods to compute constraint trees) does not match the accuracy of the better maximum likelihood methods. Results for species trees are somewhat different, with Constrained-INC coming close to the accuracy of the best species tree estimation methods, while being much faster; furthermore, using Constrained-INC allows species tree estimation methods to scale to large datasets within limited computational resources. Overall, this study exposes the benefits and limitations of divide-and-conquer strategies for large-scale phylogenetic tree estimation.

Assuntos

Biologia Computacional/métodos , Evolução Molecular , Filogenia , Alinhamento de Sequência/métodos , Algoritmos , Bases de Dados Genéticas , Genes/genética , Modelos Estatísticos

15.

Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss.

Legried, Brandon; Molloy, Erin K; Warnow, Tandy; Roch, Sébastien.

J Comput Biol ; 28(5): 452-468, 2021 05.

Artigo em Inglês | MEDLINE | ID: mdl-33325781

RESUMO

Phylogenomics-the estimation of species trees from multilocus data sets-is a common step in many biological studies. However, this estimation is challenged by the fact that genes can evolve under processes, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL), that make their trees different from the species tree. In this article, we address the challenge of estimating the species tree under GDL. We show that species trees are identifiable under a standard stochastic model for GDL, and that the polynomial-time algorithm ASTRAL-multi, a recent development in the ASTRAL suite of methods, is statistically consistent under this GDL model. We also provide a simulation study evaluating ASTRAL-multi for species tree estimation under GDL.

Assuntos

Biologia Computacional/métodos , Deleção de Genes , Duplicação Gênica , Algoritmos , Especiação Genética , Modelos Genéticos , Filogenia

16.

ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy.

Zhang, Chao; Scornavacca, Celine; Molloy, Erin K; Mirarab, Siavash.

Mol Biol Evol ; 37(11): 3292-3307, 2020 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-32886770

RESUMO

Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.

Assuntos

Técnicas Genéticas , Filogenia , Algoritmos , Plantas/genética , Leveduras/genética

17.

FastMulRFS: fast and accurate species tree estimation under generic gene duplication and loss models.

Molloy, Erin K; Warnow, Tandy.

Bioinformatics ; 36(Suppl_1): i57-i65, 2020 07 01.

Artigo em Inglês | MEDLINE | ID: mdl-32657396

RESUMO

MOTIVATION: Species tree estimation is a basic part of biological research but can be challenging because of gene duplication and loss (GDL), which results in genes that can appear more than once in a given genome. All common approaches in phylogenomic studies either reduce available data or are error-prone, and thus, scalable methods that do not discard data and have high accuracy on large heterogeneous datasets are needed. RESULTS: We present FastMulRFS, a polynomial-time method for estimating species trees without knowledge of orthology. We prove that FastMulRFS is statistically consistent under a generic model of GDL when adversarial GDL does not occur. Our extensive simulation study shows that FastMulRFS matches the accuracy of MulRF (which tries to solve the same optimization problem) and has better accuracy than prior methods, including ASTRAL-multi (the only method to date that has been proven statistically consistent under GDL), while being much faster than both methods. AVAILABILITY AND IMPEMENTATION: FastMulRFS is available on Github (https://github.com/ekmolloy/fastmulrfs). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Duplicação Gênica , Biometria , Simulação por Computador , Filogenia

18.

Correction to: The performance of coalescent-based species tree estimation methods under models of missing data.

Nute, Michael; Chou, Jed; Molloy, Erin K; Warnow, Tandy.

BMC Genomics ; 21(1): 133, 2020 02 10.

Artigo em Inglês | MEDLINE | ID: mdl-32039710

RESUMO

After publication of [1], the authors were informed by John A. Rhodes of a counterexample to Theorem 11 of [1].

19.

Non-parametric correction of estimated gene trees using TRACTION.

Christensen, Sarah; Molloy, Erin K; Vachaspati, Pranjal; Yammanuru, Ananya; Warnow, Tandy.

Algorithms Mol Biol ; 15: 1, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-31911812

RESUMO

MOTIVATION: Estimated gene trees are often inaccurate, due to insufficient phylogenetic signal in the single gene alignment, among other causes. Gene tree correction aims to improve the accuracy of an estimated gene tree by using computational techniques along with auxiliary information, such as a reference species tree or sequencing data. However, gene trees and species trees can differ as a result of gene duplication and loss (GDL), incomplete lineage sorting (ILS), and other biological processes. Thus gene tree correction methods need to take estimation error as well as gene tree heterogeneity into account. Many prior gene tree correction methods have been developed for the case where GDL is present. RESULTS: Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to ILS and/or HGT. We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-optimal tree refinement and completion (RF-OTRC) Problem, which seeks a refinement and completion of a singly-labeled gene tree with respect to a given singly-labeled species tree so as to minimize the Robinson-Foulds (RF) distance. Our extensive simulation study on 68,000 estimated gene trees shows that TRACTION matches or improves on the accuracy of well-established methods from the GDL literature when HGT and ILS are both present, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. We also show that a naive generalization of the RF-OTRC problem to multi-labeled trees is possible, but can produce misleading results where gene tree heterogeneity is due to GDL.

20.

ILS-Aware Analysis of Low-Homoplasy Retroelement Insertions: Inference of Species Trees and Introgression Using Quartets.

Springer, Mark S; Molloy, Erin K; Sloan, Daniel B; Simmons, Mark P; Gatesy, John.

J Hered ; 111(2): 147-168, 2020 04 02.

Artigo em Inglês | MEDLINE | ID: mdl-31837265

RESUMO

DNA sequence alignments have provided the majority of data for inferring phylogenetic relationships with both concatenation and coalescent methods. However, DNA sequences are susceptible to extensive homoplasy, especially for deep divergences in the Tree of Life. Retroelement insertions have emerged as a powerful alternative to sequences for deciphering evolutionary relationships because these data are nearly homoplasy-free. In addition, retroelement insertions satisfy the "no intralocus-recombination" assumption of summary coalescent methods because they are singular events and better approximate neutrality relative to DNA loci commonly sampled in phylogenomic studies. Retroelements have traditionally been analyzed with parsimony, distance, and network methods. Here, we analyze retroelement data sets for vertebrate clades (Placentalia, Laurasiatheria, Balaenopteroidea, Palaeognathae) with 2 ILS-aware methods that operate by extracting, weighting, and then assembling unrooted quartets into a species tree. The first approach constructs a species tree from retroelement bipartitions with ASTRAL, and the second method is based on split-decomposition with parsimony. We also develop a Quartet-Asymmetry test to detect hybridization using retroelements. Both ILS-aware methods recovered the same species-tree topology for each data set. The ASTRAL species trees for Laurasiatheria have consecutive short branch lengths in the anomaly zone whereas Palaeognathae is outside of this zone. For the Balaenopteroidea data set, which includes rorquals (Balaenopteridae) and gray whale (Eschrichtiidae), both ILS-aware methods resolved balaeonopterids as paraphyletic. Application of the Quartet-Asymmetry test to this data set detected 19 different quartets of species for which historical introgression may be inferred. Evidence for introgression was not detected in the other data sets.

Assuntos

Especiação Genética , Modelos Genéticos , Retroelementos , Vertebrados/genética , Animais , Elementos de DNA Transponíveis , Hibridização Genética , Filogenia

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA